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Abstract 

Background: The accurate definition of organs at risk (OARs) is required to fully exploit the benefits of 
intensity-modulated radiotherapy (IMRT) for head and neck cancer. However, manual delineation is 
time-consuming and there is considerable inter-observer variability. This is pertinent as function-sparing and 
adaptive IMRT have increased the number and frequency of delineation of OARs. We evaluated accuracy and 
potential time-saving of Smart Probabilistic Image Contouring Engine (SPICE) automatic segmentation to 
define OARs for salivary-, swallowing- and cochlea-sparing IMRT. 

Methods: Five clinicians recorded the time to delineate five organs at risk (parotid glands, submandibular 
glands, larynx, pharyngeal constrictor muscles and cochleae) for each of 10 CT scans. SPICE was then used to 
define these structures. The acceptability of SPICE contours was initially determined by visual inspection and 
the total time to modify them recorded per scan. The Simultaneous Truth and Performance Level Estimation 
(STAPLE) algorithm created a reference standard from all clinician contours. Clinician, SPICE and modified 
contours were compared against STAPLE by the Dice similarity coefficient (DSC) and mean/maximum distance 
to agreement (DTA). 

Results: For all investigated structures, SPICE contours were less accurate than manual contours. However, for 
parotid/submandibular glands they were acceptable (median DSC: 0.79/0.80; mean, maximum DTA: 1.5 mm, 
14.8 mm/0.6 mm, 5.7 mm). Modified SPICE contours were also less accurate than manual contours. The 
utilisation of SPICE did not result in time-saving/improve efficiency. 

Conclusions: Improvements in accuracy of automatic segmentation for head and neck OARs would be 
worthwhile and are required before its routine clinical implementation. 
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Background 

The accurate definition of organs at risk (OARs) is required 
to fully exploit the benefits of intensity-modulated radio- 
therapy (IMRT) for head and neck cancer [1]. However, 
manual delineation is time-consuming [2]. There is also 
considerable inter-observer variability; [3-6] which can re- 
sult in significant differences in radiation dose to OARs [4]. 
This has implications for: evaluation of radiotherapy plans; 
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interpretation of radiation effects; and meaningful compari- 
sons between treatments. Standardisation is improved by 
the use of contouring guidelines, multimodality imaging 
and consensus between experts, but variation in organ de- 
lineation remains [3,5,7]. This is of pressing importance 
with the introduction of both function-sparing and adaptive 
IMRT, where number and frequency of delineation of 
OARs are increased. 

Following head and neck radiotherapy, adverse late ef- 
fects are highly prevalent and these impact on both 
organ function and more general domains of well-being, 
such as physical, mental and social health [8]. Radiation- 
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induced xerostomia is the most commonly reported grade 
>2 late side effect, which can result in difficulties with 
speech, swallowing and dental caries [9-11]. Saliva is pro- 
duced from the major (parotid, submandibular and sublin- 
gual) and minor (soft palate, lips, cheeks) salivary glands 
[12]. The parotid-sparing intensity-modulated versus con- 
ventional radiotherapy in head and neck cancer (PAR- 
SPORT) trial demonstrated the incidence of grade >2 
xerostomia one year after treatment was significantly re- 
duced with parotid-sparing IMRT compared to 3D- 
conformal radiotherapy (38% versus 74%) [9]. One parotid 
gland should be spared to a mean dose of less than 20Gy 
or both glands to less than 25Gy [13]. For the subman- 
dibular gland, relatively modest reductions in dose (to less 
than 35Gy) may be of benefit [13]. 

Swallowing dysfunction is seen in up to half of patients 
treated with definitive synchronous chemo -radiotherapy 
and is the most common late grade >3 toxicity; the inci- 
dence has increased with intensification of treatment in- 
cluding addition of chemotherapy or altered fractionation 
[14-16]. This adversely affects quality of life, probably to 
an even greater extent than xerostomia [8,17-20]. The 
mean radiation doses to the pharyngeal constrictor mus- 
cles and supraglottic larynx are significantly associated 
with late dysphagia [19,21-27]. The volume of the larynx 
and pharyngeal constrictor muscles that receive a radi- 
ation dose >60Gy (and where possible >50Gy) should be 
minimised [28]. 

Permanent and predominantly high frequency sensori- 
neural hearing loss may occur in 40-60% of patients who 
receive radiotherapy to areas such as the nasopharynx, 
para-nasal sinuses and parotid bed [29-31]. This is asso- 
ciated with psychological and cognitive morbidity [32]. 
The mean dose to the cochlea should be limited to <45Gy 
(or more conservatively <35Gy); and when combined with 
cisplatin, strictly limited [33]. 

Significant anatomic changes and alteration in dose to 
target volumes and OARs may occur during a course of 
head and neck radiotherapy [34-37]. A standard way 
to detect inter-fraction variation is volumetric ima- 
ging using kilovoltage (kV) cone beam computed 
tomography (CT) imaging. Typically these images are 
superimposed on the planning CT scan using rigid co- 
registration. However, this only allows qualitative com- 
parison of similarity in six degrees of freedom, which 
may not be adequate if the shapes or relative position of 
target organs and OARs have changed. A potential solu- 
tion for head and neck structures is the use of automatic 
segmentation where the planning CT scan and manual 
contours serve as an atlas and are mapped to the re- 
planning or cone beam CT scan using a process of de- 
formable registration and voxel-matching [36,38-41]. 
This would facilitate calculation of changes in doses to 
the target volumes and OARs; [42] information that 



could be used to determine whether adaptive re- 
planning is required [34,43-45]. 

Smart Probabilistic Image Contouring Engine (SPICE) is 
an automated commercially available algorithm, which 
combines an atlas-based and model-based approach to seg- 
mentation of head and neck lymph node levels and OARs 
[46]. The atlas was initially derived from expert 'ground 
truth' contours. The automatic segmentation process em- 
ploys multiple-steps of deformable image registration. First, 
low-dimensional non rigid transformation maps the model 
landmarks (or mean organ positions) into the image, which 
accounts for any large displacements (atlas-based step). 
Second, there is density-based registration where each voxel 
is included or excluded from a structure depending on its 
intensity (grey-scale step) i.e., functionality is limited to CT 
scans. Third, a model-based segmentation approach is ap- 
plied where organ models (meshes') that have been created 
from averaged manual expert segmentations adapts and re- 
fines the structure (shape model-based step). This mesh 
evolution can be considered as being 'driven by the grey- 
scale and constrained by the shape mode! [47]. 

This study aims to evaluate accuracy and time-saving 
of SPICE to define OARs for salivary-, swallowing- and 
cochlea-sparing IMRT. 

Methods 

Ten radiotherapy planning CT scans were selected where 
the OARs of interest were not distorted by tumour or 
artefact (treatment planning system, Pinnacle 3 version 
9.4). Five clinicians (four Consultants/Attending Physi- 
cians and one Fellow) recorded for each scan the time to 
manually delineate the parotid and submandibular glands, 
larynx (supraglottic and glottic larynx defined as one 
structure), pharyngeal constrictor muscles (superior, mid- 
dle, inferior pharyngeal constrictor muscles and cricophar- 
yngeus muscle defined as one structure) and cochleae 
according to a locally agreed protocol based on published 
guidelines (manual' contours) [14,48,49]. SPICE was then 
used to define these structures ('SPICE' contours). Each 
clinician determined by visual inspection the acceptability 
of SPICE contours for each structure and the total time to 
modify these for each scan (modified SPICE' contours). 
The modified SPICE contours represent the utilisation of 
SPICE in clinical practice (clinician review and modifica- 
tion). These also demonstrate introduction of bias by 
automatic segmentation (in the absence of bias, modified 
and manual contours should ideally match). 

The Simultaneous Truth and Performance Level Esti- 
mation (STAPLE) algorithm employs a probability map to 
create a 'best fit' from a collection of contours (Figure 1) 
[50]. The STAPLE algorithm created a reference standard 
from all clinician manual contours ('STAPLE' contours). 
The manual, SPICE and modified SPICE contours were 
compared to STAPLE by: Dice similarity coefficient (DSC) 
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Figure 1 Right parotid gland defined by five manual (multiple 
colours) and one STAPLE (yellow) contour (one transverse CT 
slice shown). 

V / 



and mean/ maximum distance to agreement (DTA). The 
DSC is a statistical measure of spatial overlap between 
two structures. It is defined as 2x intersection volume/ 
total sum of volumes and normalises the degree of inter- 
section from 0 (no overlap) to 1 (perfect overlap), with 
good agreement defined as >0.7-0.8 [41,51,52]. DTA is a 
geometrical parameter that measures the per voxel short- 
est distance from the surface of one structure to another, 
ideal = 0 mm. Paired structures (parotid glands, subman- 
dibular glands and cochleae) were considered together. 
For the parotid and submandibular glands, SPICE gener- 
ated three contours (T, '2' or '3'), which were each based 
on different 'ground truth' data [53]. Comparisons be- 
tween these and STAPLE for all 10 patients were made to 
determine the most accurate, for subsequent use and 
evaluation. The study was conducted with appropriate 
local R&D approval. 

Statistical comparisons using multiple linear regression 
analysis (to control for possible individual patient/scan 
or clinician confounding factors) were made between 
mean values of all matrices for: SPICE against STAPLE 
versus manual against STAPLE (to determine the accur- 
acy of SPICE); and modified SPICE against STAPLE ver- 
sus manual against STAPLE (to determine the accuracy 
of modified SPICE i.e., the utilisation of SPICE). As a 
further measure of accuracy, SPICE was compared with 
the most discordant clinician contours (determined 
against STAPLE and by ranking of clinicians) for each 
structure measured by DSC and DTA, using the Wil- 
coxon signed rank test. The total times to manual versus 
modify SPICE contours for all structures and clinicians 
were compared using Students paired £-test (to deter- 
mine efficiency in the utilisation of SPICE). Significance 
was assessed at the p < 0.05 level. 



Results 

Accuracy of SPICE 

SPICE submandibular gland T and parotid gland '2' con- 
tours demonstrated best concordance with STAPLE 
(Table 1) and were used in subsequent comparisons. 

The mean DSCs were significantly reduced for SPICE 
contours compared with manual for all structures 
(Figure 2). All SPICE contours were inferior to the most 
discordant manual contours (Figure 2). However, for par- 
otid and submandibular glands SPICE contours, the re- 
spective median and interquartile ranges for DSCs were 
0.79 (0.74, 0.83) and 0.80 (0.70, 0.85), suggesting accept- 
ability for these structures. The mean and maximum 
DTAs for SPICE contours and manual were similar for 
parotid glands and cochleae but statistically significantly 
worse for submandibular glands, larynx and pharyngeal 
constrictor muscles (Figures 3 and 4). Similarly, except for 
the parotid glands and cochleae, the SPICE contours mean 
and maximum DTAs were inferior to the most discordant 
clinician manual contours. However, for submandibular 
glands, the respective median and interquartile ranges for 
mean and maximum DTAs were relatively minor: 0.6 mm 
(0.4-1.0) and 5.6 mm (4.7-8.2 mm). 

Utilisation of SPICE 

The total proportions of SPICE contours determined by 
visual inspection not to require alteration were: parotid 
glands (17%), submandibular glands (41%), larynx (8%), 
pharyngeal constrictor muscles (4%), and cochleae (28%). 
The mean DSCs were significantly reduced for modified 
SPICE contours compared with manual for all structures 
(Figure 5). However, the respective median and interquar- 
tile ranges for modified SPICE DSCs for parotid glands, 
submandibular glands and larynx were: 0.85 (0.83, 0.86), 
0.85 (0.82, 0.87), and 0.76 (0.72, 0.82), which represented 

Table 1 SPICE version 1, 2 and 3 against STAPLE for 
definition of parotid and submandibular glands; mean/ 
median values shown 

SPICE contours n = 20 DSC mean DTA (mm) max DTA (mm) 



Parotid Glands 

1 0.77/0.79 2.2/1.9 19.0/17.4 

2 0.78/0.79 1.6/1.5 14.7/14.8 

3 0.72/0.75 3.3/2.5 21.5/19.8 
Submandibular Glands 

1 0.70/0.80 1.5/0.6 7.2/5.7 

2 0.67/0.78 2.0/0.4 8.2/5.6 

3 0.64/0.68 3.0/1.0 11.4/8.0 



Abbreviations: DSC Dice similarity coefficient, DTA distance to agreement, max 
maximum, mm millimetres, n number of SPICE 1, 2 or 3 contours for parotid or 
submandibular glands (10 CT scans, two of each structure per scan) Parotid 
gland '2' and submandibular gland T SPICE contours were taken forward for 
subsequent investigation. 
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Figure 2 Dice similarity coefficient - SPICE against STAPLE compared with: (i) all manual contours against STAPLE (left-side graphs); (ii) 
individual clinicians manual contours against STAPLE (right-side graphs, statistical comparisons shown between most discordant clinician 
contours against STAPLE versus SPICE against STAPLE) for A. parotid glands, B. submandibular glands, C. larynx, D. pharyngeal constrictor 
muscles, E. cochleae. *p < 0.05, **p < 0.01, ***p < 0.001. Abbreviations: n, total number of manual or SPICE contours (for paired organs, two per scan). 



good agreement. The mean and maximum DTAs for 
modified SPICE contours compared with manual were 
similar for the pharyngeal constrictor muscles and coch- 
leae but significantly worse for parotid glands, subman- 
dibular glands and larynx (Figures 6 and 7). For these 
three structures, the respective median and interquartile 
ranges for the mean/maximum DTAs were 1.2 mm 
(0.8 mm- 1.7 mm)/10.6 mm (8.0 mm- 14.8 mm), 0.4 mm 
(0.2 mm-0.7 mm)/4.8 mm (4.0 mm-5.9 mm), 1.0 mm 
(0.6 mm- 1.6 mm)/9.3 mm (7.6-10.2), representing rela- 
tively minor differences for submandibular glands. 



Efficiency in utilisation of SPICE 

The respective per scan overall mean times for manual 
and modified SPICE contours were 14.0 and 16.2 mi- 
nutes (difference, 15.7%) (Figure 8). Only one out of five 
clinicians showed a mean reduction in per scan overall 
time to modify SPICE contours compared with manual. 

Discussion 

This study showed that for head and neck OARs: (i) SPICE 
contours were less accurate than manual contours, but ac- 
ceptable for the definition of parotid and submandibular 
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Figure 3 Mean Distance to Agreement (mm) - SPICE against STAPLE compared with: (i) all manual contours against STAPLE (left-side 
graphs); (ii) individual clinicians manual contours against STAPLE (right-side graphs, statistical comparisons shown between most 
discordant clinician contours against STAPLE versus SPICE against STAPLE) for A. parotid glands, B. submandibular glands, C. larynx, 
D. pharyngeal constrictor muscles, E. cochleae. *p < 0.05, **p < 0.01, ***p < 0.001. Abbreviations: n, total number of manual or SPICE contours 
(for paired organs, two per scan). 



glands; (ii) modified SPICE contours remained inferior to 
manual contours; and (iii) the utilisation of SPICE com- 
pared with manual delineation did not result in time- 
saving/improve efficiency. 

Automatic segmentation to define selected head and 
neck OARs may reduce inter-observer variability [54,55]. 
Chao et al compared for two CT scans and eight clini- 
cians, manual and automatic modified contours for de- 
lineation of the clinical target volume as well as parotid 



glands, spinal cord, brainstem and (for one scan) the 
optic apparatus [54]. For the OARs, inter-observer vari- 
ability was significantly reduced for modified compared 
with manual contours. This was associated with a mean 
time saving of 26%-47%, which depended on experience 
of the oncologist. In a subsequent study, the ISOgray 
atlas-based auto-segmentation algorithm was evaluated 
for definition of the brainstem, parotid glands and man- 
dible [55]. The study was conducted at 2 centres, where 
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Figure 4 Maximum Distance to Agreement (mm) - SPICE against STAPLE compared with: (i) all manual contours against STAPLE (left-side 
graphs); (ii) individual clinicians manual contours against STAPLE (right-side graphs, statistical comparisons shown between most 
discordant clinician contours against STAPLE versus SPICE against STAPLE) for A. parotid glands, B. submandibular glands, C. larynx, D. 
pharyngeal constrictor muscles, E. cochleae. *p < 0.05, **p < 0.01, ***p < 0.001. Abbreviations: n, total number of manual or SPICE contours (for 
paired organs, two per scan). 



a total of 3 clinicians either manually delineated (2 clini- 
cians, 3 scans each) or modified automated contours (1 
clinician, 7 scans); for only one scan were both manual 
and modified contours defined. The mean DSCs for all 
organs were 0.68 and 0.82 for manual and modified con- 
tours, respectively; and the sensitivity and specificity for 
manual versus modified contours were 63%-91% and 
60%-80% versus 63-91% and 89-98%, respectively. These 
results suggested reduced inter-observer variability for 
modified contours compared with manual. However, 



while demonstration of reduced inter-observer variabil- 
ity is important, it is not sufficient, because there is po- 
tential introduction of bias and systematic errors. 

The updated Brainlab automated segmentation algo- 
rithm, which employs atlas-based and deformable regis- 
tration, was assessed for accuracy of definition of neck 
nodal regions and selected head and neck OARs [56]. In 
10 'ideal' cases without neck nodes on at least one side, 
the ipsilateral parotid gland, spinal cord and brainstem 
were contoured; and in 10 cases with neck node 
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Figure 5 Dice similarity coefficient - Modified SPICE against STAPLE compared with all manual contours against STAPLE for A. parotid 
glands, B. submandibular glands, C. larynx, D. pharyngeal constrictor muscles, E. cochleae. *p < 0.05, **p < 0.01, ***p < 0.001. 
Abbreviations: n, total number of manual or modified SPICE contours (for paired organs, two per scan). 



involvement both parotid glands, submandibular glands, 
spinal cord, brainstem and mandible were defined. One 
clinician manually contoured and then modified the auto- 
matic contours for each scan/patient. The automatic and 
modified contours were compared with manual contours 
using the DSC as well as mean and maximum DTA. The 
spinal cord and mandible contours were not included in 
the analysis because the automatic contours did not re- 
quire modification, except for mandible in one case. For 
the second group of 10 cases, the OARs were considered 
together. The authors found that except for spinal cord, 
the automatic contours systematically required some 
modification, with resultant improvement in DSC and 
DTA measures. There was increased efficiency in defin- 
ition of OARs with a reduction in mean time to manual 
compared with modified contours from 11.2 minutes to 
4.5 minutes (60%) and 16.4 to 6.3 minutes (62%), in 



respective groups. This time-saving is partly due to the 
automatic contours for spinal cord, brainstem and man- 
dible requiring no or little modification. 

Clinical validation of a multiple-subject atlas-based 
autosegmentation tool was performed by measuring the 
DSC and mean DTA for manual contours (outlined by 
one of 10 clinicians and agreed by an expert panel) and 
modified contours (outlined by one of two clinicians) for 
neck levels, parotid and submandibular glands in 12 pa- 
tients [57] . For manual versus automatic contours, the re- 
spective DSC/mean DTA for parotid and submandibular 
glands were 0.80/2.3 mm and 0.72/1.6 mm. For manual 
versus modified automatic contours, the respective DSC/ 
mean DTA for parotid and submandibular glands were 
0.81/2.1 mm and 0.77/1.2 mm. 

We found that SPICE automatic contours were less ac- 
curate/inferior to manual contours for all investigated 
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Figure 6 Mean Distance to Agreement (mm) - Modified SPICE against STAPLE compared with all manual contours against STAPLE for 
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structures, but acceptable for the parotid and submandibu- 
lar glands. For the parotid and submandibular glands, the 
DSCs were satisfactory; [41,52] for parotid glands, the mean 
and maximum DTAs were similar to manual contours and 
for submandibular glands, the differences were relatively 
minor. The modification of automatic contours improved 
accuracy but remained inferior to manual contours and did 
not result in time-saving. There are a number of possible 
reasons for these findings. First, the processes of automatic 
segmentation, both grey-scale and model-based are limited 
by insensitivity to boundary or edge detection [47]. This is 
important because the differences in attenuation between 
soft tissues are often small and the shapes of organs diver- 
gent. The computer-based algorithms do not account for 



nuances in the honed technique of the expert manual con- 
tourer. Second, while there are published delineation guide- 
lines for OARs, there is no agreed international consensus, 
especially for definition of the larynx and pharyngeal con- 
strictor muscles [14,48]. The SPICE atlas may have been 
developed from dissimilar 'ground truth' contours. Where 
available, an alternative investigational strategy would be to 
adapt the local contouring protocol to that used to define 
the atlas contours [58]. Third, to produce tightly conformed 
volumes, relatively small alterations in automatic contours 
may be required, which are time-consuming. The modifica- 
tion process is then less efficient than manual delineation, 
where techniques such as interpolation between CT slice 
levels may be used. 
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Whether differences between manual, automatic or 
modified contours result in clinically relevant alterations in 
measured doses to OARs is uncertain. This will partly de- 
pend on proximity of normal structures to the treatment 
volume and the dose gradient. In this study, the target vol- 
umes were not defined. This may have influenced the low 
percentage of OARs determined by visual inspection not to 
require alteration i.e., clinicians only considered the con- 
formity of automatic contours to normal structures rather 
than clinical relevance or requirement for this. 



This study represents an independent clinical evalu- 
ation of automatic segmentation using SPICE and its 
utilisation for head and neck OARs. It determined the 
accuracy of SPICE by comparison against a reference 
standard created using STAPLE, for five head and neck 
OARs important in function-sparing IMRT. Future work 
should evaluate automatic segmentation in the presence 
of distortion by tumour or artefact e.g., dental amalgam; 
and determine the variation in measured dose to OARs 
between manual, automatic and modified contours. 
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Conclusion 

For the investigated head and neck OARs, SPICE auto- 
matic segmentations were less accurate than manual 
contours. However, these were acceptable for the defin- 
ition of parotid and submandibular glands. The modifi- 
cation of SPICE contours improved accuracy, but these 
remained inferior to manual contours and the process 
did not result in time-saving. Improvements in auto- 
matic segmentation of head and neck OARs would be 
worthwhile and are required before routine clinical 
implementation. 
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